Refine your search
Collections
Co-Authors
Year
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Mangla, Neha
- Analysis of Diabetic Dataset and Developing Prediction Model by using Hive and R
Abstract Views :209 |
PDF Views:0
Authors
Affiliations
1 Atria Institute of Technology, Bangalore – 24, Karnataka, IN
2 Department of ISE, Atria Institute of Technology, Bangalore - 24, Karnataka, IN
1 Atria Institute of Technology, Bangalore – 24, Karnataka, IN
2 Department of ISE, Atria Institute of Technology, Bangalore - 24, Karnataka, IN
Source
Indian Journal of Science and Technology, Vol 9, No 47 (2016), Pagination:Abstract
Objectives: Diabetic is one of the most venerable disease spreading in the world, it will caused due to hereditary and also due to lack of diet. But if we analyze this disease then we can find some fact from the symptoms. Using these facts we can make a predicting model to predict the diabetic disease. By using this model the prediction of the diabetic will be easier and lots of benefits can be provided to the humanity. By sharing the information we extract from our model to the government will help the government for making the welfare program for the citizens. Method and Analysis: In this paper we have taken the sample of Pima Indian diabetic dataset which is having the 768 samples. So first of all that dataset will be given as input to hive so as to convert it into a formatted dataset. Then we will apply few queries on the formatted dataset in order to extract the useful information. Then we use the R tool in order to perform the statically analysis for generating the graph and also for calculating gini index and developing the prediction model, and efficiency of the model is also found. Findings: In our paper we have performed few queries on the diabetic dataset using hive such as finding the distinct values from the table and by finding it we can analyze the different attributes of the table and also time taken for analysis can also be calculated by default which is one of the positive points of using the hive. Then we will be using the r tool for statically analysis, as we all know picture speaks more than the word so by using the graph generated by r tool we can analyze the dataset easily and fast as compared to going through each rows of the dataset. We calculate gini index for attributes in order to find the inequality among the values using r tool. We also make the prediction model using KNN algorithm and we also find the accuracy of our model. These all things done by the use of r tool, which makes it simpler and also make the method easy to understand by the user to make prediction model and to calculate the efficiency of the model. By using the prediction model we can find the number of sample predictions made correctly. Improvements: We can improve the paper by doing the operations performed on large dataset such as millions of dataset in order to make paper more efficient. Our project efficiency is about 79% which can further be improved.Keywords
Big-Data, Gini Index, Hadoop, Hive, K Nearest Neighbor, R.- EPH-Enhancement of Parallel Mining using Hadoop
Abstract Views :125 |
PDF Views:0
Authors
Neha Mangla
1,
K. Sushma
1
Affiliations
1 A.I.T., Bangalore, IN
1 A.I.T., Bangalore, IN
Source
International Journal of Engineering Research, Vol 5, No SP 5 (2016), Pagination: 1009-1015Abstract
Data in this era is generating at tremendous rate so now it is need of today to handle the data to gain useful insight, this data can be useful for researcher and accommodation to do analysis. As we know traditional system cannot handle more than terabytes of data since it affects performance and also storage is very costly. Bigdata is a innovative technique analyze, store, manage, distribute and capture datasets. To achieve compressed storage in this implement a parallel mining algorithm called as enhancement of parallel mining using Hadoop. Hadoop is a platform which enables the distributing processing using mapreduce programming. This help in getting result at very fast rate as result in less time help in competing for growth of business. For the analysis in this paper unstructured datasets from real-time is taken and converted to structured format and process in mapreduce. It is found in literature existing mining algorithm for real time datasets lacks in fault tolerance, load balancing, data distribution and automatic parallelization. To overcome these disadvantages we implement mapreduce for association analysis. In EPH we improve performance by distributing load across the computing nodes .In our proposed solution we use real-world celestial spectral data .The graphical representation of traditional system comparison with Hadoop is shown in this paper.Keywords
Bigdata, Hadoop, Mapreduce, Parallel Mining, Association Analysis, Enhancement of Parallel Mining using Hadoop(EPH).- Machine Learning Approach for Unstructured Data Using Hive
Abstract Views :190 |
PDF Views:1
It is viable to store and process these ransom amount of data on Hadoop; which is a low cost, reliable, scalable and fault tolerant Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Hadoop implements MapReduce programing model for storing and processing large data sets with a parallel, distributed algorithm on commodity hardware. Nevertheless, the programming model expects the developers to write bespoke programs that are less flexible, time consuming, hard to code; maintain and reuse. This challenging task of writing complex MapReduce codes was rationalized by making use of HiveQL.
Hive is the platform required to run HiveQL. Hive is built on top of Hadoop to query Big Data. Internally the Hive queries are converted into the corresponding MapReduce task.
In this paper, by making use of machine learning algorithm a movie rating prediction system is built based on MovieLens dataset.
Authors
Affiliations
1 Atria Institute of Technology, Bangalore, IN
1 Atria Institute of Technology, Bangalore, IN
Source
International Journal of Engineering Research, Vol 5, No SP 4 (2016), Pagination: 801-807Abstract
Voluminous amount of structured, semistructured and unstructured data sets that have the potential to learn the relationship among data in the area of business is being collected rapidly; termed as big data. The storage of large chunks of data is difficult as even terabytes and petabytes of traditional data warehousing solutions is insufficient and exorbitant.It is viable to store and process these ransom amount of data on Hadoop; which is a low cost, reliable, scalable and fault tolerant Java-based programming framework that supports the processing of large data sets in a distributed computing environment. Hadoop implements MapReduce programing model for storing and processing large data sets with a parallel, distributed algorithm on commodity hardware. Nevertheless, the programming model expects the developers to write bespoke programs that are less flexible, time consuming, hard to code; maintain and reuse. This challenging task of writing complex MapReduce codes was rationalized by making use of HiveQL.
Hive is the platform required to run HiveQL. Hive is built on top of Hadoop to query Big Data. Internally the Hive queries are converted into the corresponding MapReduce task.
In this paper, by making use of machine learning algorithm a movie rating prediction system is built based on MovieLens dataset.